Apparatus and method for scheduling threads in multi-threading processors

ABSTRACT

An multi-threading processor is provided. The multi-threading processor includes a first instruction fetch unit to receive a first thread and a second instruction fetch unit to receive a second thread. A multi-thread scheduler coupled to the instruction fetch units and a execution unit. The multi-thread scheduler determines the width of the execution unit and the execution unit executes the threads accordingly.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates generally to increasing utilizationand overall performance in multi-threading microprocessors. Moreparticularly, the present invention relates to more effectivelyscheduling threads to optimize a wide in-order processor.

[0003] 2. Description of the Related Art

[0004] In a conventional computer system, microprocessors run severaldifferent processes. The computer system utilizes an operating system(OS) to direct the microprocessor to run each of the processes based onpriority and on the process not waiting on an event (e.g., a disk accessor a user keypress) to continue. The simplest type of priority systemmerely directs the OS to run the programs in sequence (i.e., the lastprogram to be run has the lowest priority). In other systems, thepriority of a program may be assigned based on other factors, such asthe importance of the program, how efficient it is to run the program,or both. Through priority, the OS is then able to determine the order inwhich programs or software threads or contexts are executed by theprocessor. It takes a significant amount of time, typically more thanthe time required to execute several hundred instructions, for the OS toswitch from one running process to another running process.

[0005] Because of the overhead incurred from each process switch, the OSwill only switch out a process when it knows the process will not beready to run again for a significant amount of time. However, with theincreasing speed of processors, there are events, which make the processunexecutable for an amount of time that is not long enough to justify anOS-level process switch. When the program is stalled by such an event,such as a cache miss (e.g., when a long latency memory access isrequired), the processor experiences idle cycles for the duration of thestalling event, decreasing the overall system performance. Because newerand faster processors are always being developed, the number of idlecycles experienced by processors is also increasing. Although memoryaccess speed is also being improved, it has not been increased at thesame rate as microprocessor speeds, therefore, processors are spendingan increasing percentage of time waiting for memory to respond.

[0006] Recent developments in processor design have allowed formulti-threading, where two or more distinct threads are able to make useof available processor resources. A Simultaneous Multi-Threading (SMT)microprocessor allows multiple threads to share and to compete forprocessor resources at the same time. The threads are scheduledconcurrently and therefore operations from all of the threads progressdown the pipeline simultaneously. If a thread in a SMT system is stalledand waiting for memory, the other threads will continue execution, thusallowing the SMT system to continue executing useful work during a cachemiss.

[0007] Because multiple threads are able to issue instructions duringeach cycle, a SMT system typically results in a dramatic increase insystem throughput. However, the performance improvement is subject tocertain boundary conditions. The effectiveness of SMT decreases as thenumber of threads increases because the underlying machine resources arelimited and because of the exponential cost increase of inspecting andtracking the status of each additional thread.

[0008] A major problem with scheduling threads in a SMT system occurswhen developers attempt to build a SMT system with an in-order machinerather an out of order machine. As with any threads in any singlethreaded system, the instructions to be executed in a SMT system must begiven an order of execution, determined by whether a particularinstruction is dependent on another. For example, if a secondinstruction depends on a result from a first instruction, the processormust execute instruction one before executing instruction two.

[0009] An out of order machine includes built in hardware thatdetermines whether or not instructions in a thread are dependent on theresult of another instruction. If two threads are independent of eachother, it is unnecessary to coordinate their scheduling of executionrelative to each other. However, if an instruction is dependent uponanother, then the out of order machine schedules the dependentinstruction to be executed after the instruction from which it depends.After examining many instructions, the out of order machine is able tocreate chains of dependencies for the processor within its executionprofile. Because the two threads are always independent in a SMT system,the existing hardware in the out of order machine may be extended toschedule the threads to execute in parallel.

[0010] An in-order machine does not include hardware to determineinstruction dependency. Instead, instructions are simply presented inmemory in the same order that the compiler or program places them.Therefore, the instructions must be executed in the same exact orderthat they were placed into memory. Because in-order machines cannotdetermine the dependency of each instruction, an in-order machine is notable to properly reorder instructions from different threads in a SMTsystem. An additional in-order scheduling problem arises when theprocessor is not wide enough and does not have the bandwidth to executethe multiple threads in parallel.

[0011] While SMT systems are able to process more than two threadssimultaneously (some developers have tried to schedule as many as eightthreads at a time), each additional thread requires an increase inmachine cost. For example, a large parallel logic array (PLA) may berequired to coordinate and schedule all of the threads if a SMT systemis complex enough. Therefore, it is often not an efficient use ofprocessing power to execute more than two threads at the same time.Furthermore, such additional overhead is often completely unwarrantedbecause few machines are wide enough or have the resources to supportmore than two active threads.

[0012] In view of the foregoing, it is desirable to have a method andapparatus that provides for a system able to maximize the use of wideprocessor resources in an in-order machine. In particular, it isdesirable to have an in-order SMT system because they are simpler thanout of order machines, thereby conserving valuable chip space, consumingless power, and generating less heat. It is also desirable to have anin-order SMT system with minimal circuit impact.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings. Tofacilitate this description, like reference numerals designate likestructural elements.

[0014]FIG. 1 illustrates a multithreading system in accordance with oneembodiment of the present invention.

[0015]FIG. 2 illustrates the in-order multi-threading processor inaccordance with one embodiment of the present invention.

[0016]FIG. 3 is a flow chart of a method for scheduling threads for anin-order multithreading processor in accordance with one embodiment ofthe present invention.

[0017]FIG. 4 illustrates two threads being executed in the bandwidth ofan in-order multithreading processor in accordance with one embodimentof the present invention.

DETAILED DESCRIPTION

[0018] A method and apparatus for a multi-threading computer system toefficiently schedule threads in a wide in-order processor is provided.In the following description, numerous specific details are set forthin-order to provide a thorough understanding of the present invention.It will be understood, however, to one skilled in the art, that thepresent invention may be practiced without some or all of these specificdetails. In other instances, well known process operations have not beendescribed in detail in-order not to unnecessarily obscure the presentinvention.

[0019] In general, to improve the performance of a microprocessor, thenumber of transistors that must fit onto a single chip die must beincreased. Therefore, the spatial constraint of a single semiconductorchip is perhaps the greatest limiting factor in the speed of amicroprocessor and other forms of chips. Developers and engineersconstantly strive to find novel means to fit more transistors onto achip die. For example, the advent of 0.13 micron semiconductor designand fabrication is specifically intended to form smaller patterns andfeatures in a chip. The technology would then allow even moretransistors and other circuitry to be placed within the spatialconstraint of a single chip.

[0020] Overcoming the spatial limitations of a semiconductor chip willonly become more and more important in future generations of processors,therefore research is always ongoing to deal with spatial limitations ofthe future. Processor designs that conserve and efficiently utilizespace on a chip will become more and more advantageous over processordesigns that do not. Therefore, the greatest advantage of an in-ordermachine over an out of order machine is simplicity of design. Because anout of order machine is much more complex, it requires a much largernumber of transistors and much more space.

[0021] For example, in the Intel processor family, the Pentium processoris an in-order machine with approximately three million transistors. Bycomparison, the Pentium Pro, which is an out of order machine, usesabout six and a half million transistors, requiring much more space thanthe in-order Pentium. Because of the additional transistors, the PentiumPro also requires more power and generates more heat. The presentinvention takes advantage of the existing space conserving design of thein-order machine, which is used in Intel's Itanium Processor Family(IPF) by enabling the in-order platform to support a multithreadingprocessor.

[0022]FIG. 1 is an illustration of a multi-threading computer system 10in accordance with one embodiment of the present invention.Multi-threading computer system 10 includes an in-order multi-threadingprocessor 12 that is coupled to a memory module 14 and a mass storagedevice 15. In-order multi-threading processor 12 is preferably a SMTprocessor. Memory module 14 is typically a form of random access memory(RAM), such as synchronous dynamic RAM (SDRAM) or Rambus Dynamic RAM(RDRAM). Examples of mass storage device 15 include hard disk drives,floppy drives, optical drives, and tape drives. In multi-threadingsystem 10, programs are loaded from mass storage device 15 into memorymodule 14 and then executed by in-order multi-threading processor 12.

[0023] In-order multi-threading processor 12 must execute instructionsin the order the instructions were entered into memory module 14.Therefore, unlike an out of order processor, in-order multi-threadingprocessor 12 is unable to create independent chains of executionnecessary to extract instruction level parallelism (ILP) from a singlethread. To determine the dependencies of each of the instructions fromthe multiple threads, multithreading computer system 10 relies on aspecialized multi-thread scheduler and a compiler to identify sets ofindependent instructions and logic to schedule the threads.

[0024]FIG. 2 illustrates in-order multi-threading processor 12 inaccordance with one embodiment of the present invention. In-ordermulti-threading processor 12 includes a pair of instruction fetch units16 and 18 for thread 1 and thread 2, respectively. Each of theinstruction fetch units (IFU) 16 and 18 are uni-directionally coupled tocorresponding instruction decode units (IDU) 20 and 22 for threads 1 and2. IDUs 20 and 22 are then coupled to a multi-thread scheduler 24.In-order multi-threading processor 14 also includes an execution unit26, which is coupled to multi-thread scheduler 24.

[0025] IFUs 16 and 18 read instructions from memory (such as aninstruction cache) for threads 1 and 2. Each IFU functions to ensurethat the processor has enough instruction bandwidth to sustain thehighest possible instruction issue rate. IFUs also operate to predictfuture instruction sequences with a high degree of accuracy. Theinstructions are then transmitted to IDUs 20 and 22, which performoperations such as register renaming and initial dependency checks. IDUsalso function to predict branch paths and compute target addresses forbranch instructions.

[0026] Instructions from IDU 20 and 22 are then transmitted tomulti-thread scheduler 24. Multi-thread scheduler 24 takes into accountthe available local capacity and prioritizes the incoming instructionsfrom both thread 1 and thread 2, with the goal of maximizing processorutilization. Multi-thread scheduler 24 therefore determines whether ornot execution unit 26 is wide enough to execute thread 1 and thread 2 atthe same time and subsequently decides whether to execute the threads inparallel or in series. Other examples of scheduling policies may includescheduling high load/store processes and low load/store processestogether to yield better system utilization and performance.

[0027] Typically, a programmer writes the program in a language such asPascal, C++ or Java, which is stored in a file called the source code.The programmer then runs the appropriate language compiler to convertthe source code into object code. The object code comprises machinelanguage that the processor can execute one instruction at a time. Inaddition to generating object code, a compiler may support many otherfeatures to aid the programmer. Such features may include automaticallocation of variables, arbitrary arithmetic expressions, variablescope, input/output operations, higher-order functions and portabilityof source code.

[0028] In the present invention, the compiler explicitly describesblocks of independent operations to the in-order machine so that may beexecuted in parallel. In contrast, a compiler for earlier machines wasnot capable of describing independent instructions. Instead, hardwarewas required to determine independent instructions at run time.Therefore, in the present invention, the task of generating instructionlevel parallelism is accomplished statically at compile time rather thandynamically at run time.

[0029] This thread dispersal of the compiler of the present inventionfor in-order machines thus motivates the development of wide in-ordermachines that can execute many instructions simultaneously. In addition,to efficiently utilize the capabilities of a wide inorder machine, themachine must also be able to schedule multiple threads when the compilercannot find enough ILP in a single thread to fully occupy the machine asdescribed above with regard to multi-thread scheduler 24.

[0030]FIG. 3 is a flow chart of a method 28 for scheduling threads foran in-order multithreading processor in accordance with one embodimentof the present invention. Method 28 begins at a block 30 where thread 1and thread 2 are fetched. A block 32 determines whether the in-ordermulti-threading processor is wide enough to execute both threads 1 and 2in parallel. The width of threads 1 and 2 are examined during each cycleand then compared to the width of the processor. If the in-ordermulti-threading processor is wide enough to execute all of theinstructions in threads 1 and 2, then both threads 1 and 2 are executedin parallel in a block 34. If the in-order multi-threading processor isnot wide enough to execute both threads, then the threads are executedin series in blocks 36 and 38.

[0031]FIG. 4 illustrates two threads being executed in the bandwidth ofan in-order multithreading processor in accordance with one embodimentof the present invention. As shown, an in-order multi-threadingprocessor usually has enough width to execute threads 1 and 2 inparallel. This is because the compiler can usually find instructionsfrom one thread that are only use half of the machine. The individualinstructions in threads 1 and 2 are called syllables 40 and 42, whichare organized cycle by cycle based on whether a particular syllable 40is dependent upon the result of another. Using method 28 as describedabove, the threads are analyzed to determine if syllables 40 from thread1 and syllables 42 from thread 2 fit in the width of the in-ordermulti-threading processor. If the syllables from both threads cannot beexecuted in parallel, then each thread must be executed in series.

[0032] Referring to FIG. 4, lines A, B, C, D, and G illustrate examplesof an in-order multi-threading processor executing threads 1 and 2 inparallel. However, in line E, thread 1 included four syllables 40 andthread 2 included three syllables 42, proving to be too wide for thein-order multi-threading processor. The syllables 42 for thread 2 werethen deferred until the next cycle represented by line F. Then in lineG, the in-order multi-threading processor was again wide enough toexecute both threads 1 and 2, therefore parallel operations resumed.

[0033] While multi-thread scheduler 24 in FIG. 2 is programmed toschedule only two threads for processing, it is well known in the artthat a multi-thread scheduler may be configured to schedule additionalthreads. Each additional thread being scheduled by in-ordermulti-threading processor 12 would also require a correspondinginstruction fetch and instruction decode unit. While such a system wouldbe able to process more than two threads simultaneously, each additionalthread requires an exponential increase in machine cost, such as a largeparallel logic array (PLA), to coordinate and schedule all of thethreads.

[0034] Therefore, adding additional threads to the present inventioncould eliminate the advantage of multi-threading. In fact, theper-thread cost is even larger for an in-order machine than for anout-of-order machine. With the current configuration of in-ordermulti-threading processor 12, it is not an efficient use of processingpower to execute more than two threads at the same time, particularlybecause the processor is not currently wide enough to support more thantwo threads in parallel.

[0035] In summary, the present invention provides for an apparatus andmethod for scheduling multiple threads for a simultaneousmulti-threading in-order processor. Despite the fact that out of orderdynamic machines have the advantage of possessing an existing structureto schedule threads and create independent chains of execution in amulti-threading processor, in-order static machines possess manydesirable architectural characteristics, such as simplicity of design.In-order machines are also easier to design than out of order machinesbecause in-order machines are less complex.

[0036] Another advantage of an in-order machine is the conservation ofspace and power. Although out of order machines offer additionalfeatures in return for the additional design effort, the complexity ofthe architecture is a disadvantage because it requires much more of thelimited space on a semiconductor chip. As microprocessor speeds continueto increase, the number of transistors that must fit into asemiconductor chip die must also increase, a process that could lead tooverheating. The present invention therefore not only provides forutilizing an in-order machine for multi-threading processes, but alsofor conserving power and chip space, allowing much more flexibility forfuture microprocessor designs.

[0037] Other embodiments of the invention will be appreciated by thoseskilled in the art from consideration of the specification and practiceof the invention. Furthermore, certain terminology has been used for thepurposes of descriptive clarity, and not to limit the present invention.The embodiments and preferred features described above should beconsidered exemplary, with the invention being defined by the appendedclaims.

What is claimed is:
 1. A multi-threading processor, comprising: a firstinstruction fetch unit to receive a first thread and a secondinstruction fetch unit to receive a second thread; an execution unit toexecute said first thread and said second thread; and a multi-threadscheduler coupled to said first instruction fetch unit, said secondinstruction fetch unit, and said execution unit, wherein saidmulti-thread scheduler is to determine whether the width of saidexecution unit.
 2. A multi-threading processor as recited in claim 1,wherein the multi-thread scheduler unit determines whether the executionunit is to execute the first thread and the second thread in paralleldepending on the width of the execution unit.
 3. A multi-threadingprocessor as recited in claim 2, wherein the multi-thread processor isan in-order processor.
 4. A multi-threading processor as recited inclaim 3, wherein the execution unit executes the first thread and thesecond thread in parallel.
 5. A multi-threading processor as recited inclaim 3, wherein the execution unit executes the first thread and thesecond thread in series.
 6. A multi-threading processor as recited inclaim 3, wherein the first thread and the second thread are compiled tohave instruction level parallelism.
 7. A multi-threading processor asrecited in claim 6, further comprising: a first instruction decode unitcoupled between the first instruction fetch unit and the multi-threadscheduler; and a second instruction decode unit coupled between thesecond instruction fetch unit and the multi-thread scheduler.
 8. Amulti-threading processor as recited in claim 4, wherein the executionunit executes only two threads in parallel.
 9. A method for schedulingthreads in a multi-threading processor, comprising: determining whethersaid multi-threading processor is wide enough to execute a first threadand a second thread in parallel; and executing said first thread andsaid second thread in parallel if said multi-threading processor is wideenough to execute the first thread and the second thread in parallel.10. A method for scheduling threads as recited in claim 9, furthercomprising executing the first thread and the second thread in series ifsaid multi-threading processor is not wide enough.
 11. A method forscheduling threads as recited in claim 10, wherein the multi-threadingprocessor is an in-order processor.
 12. A method for scheduling threadsas recited in claim 11, further comprising compiling the first threadand the second thread, wherein the first thread and the second threadhave instruction level parallelism.
 13. A method for scheduling threadsas recited in claim 12, wherein the multi-threading processor executesonly two threads in parallel.
 14. A method for scheduling threads asrecited in claim 13, further comprising: fetching the first thread andthe second thread; and decoding the first thread and the second thread.15. A set of instructions residing in a storage medium, said set ofinstructions capable of being executed by a processor for searching datastored in a mass storage device comprising: determining whether saidmulti-threading processor is wide enough to execute a first thread and asecond thread in parallel; and executing said first thread and saidsecond thread in parallel if said multi-threading processor is wideenough to execute the first thread and the second thread in parallel.16. A method for scheduling threads as recited in claim 15, furthercomprising executing the first thread and the second thread in series ifsaid multi-threading processor is not wide enough.
 17. A method forscheduling threads as recited in claim 16, wherein the multithreadingprocessor is an in-order processor.
 18. A method for scheduling threadsas recited in claim 17, further comprising compiling the first threadand the second thread, wherein the first thread and the second threadhave instruction level parallelism.
 19. A method for scheduling threadsas recited in claim 18, wherein the multithreading processor executesonly two threads in parallel.
 20. A method for scheduling threads asrecited in claim 19, further comprising: fetching the first thread andthe second thread; and decoding the first thread and the second thread.